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Abstract 

We address the task of annotating images with semantic 
tuples. Solving this problem requires an algorithm which 
is able to deal with hundreds of classes for each argument 
of the tuple. In such contexts, data sparsity becomes a 
key challenge, as there will be a large number of classes 
for which only a few examples are available. We propose 
handling this by incorporating feature representations of 
both the inputs (images) and outputs (argument classes) 
into a factorized log-linear model, and exploiting the flex¬ 
ibility of scoring functions based on bilinear forms. Ex¬ 
periments show that integrating feature representations of 
the outputs in the structured prediction model leads to bet¬ 
ter overall predictions. We also conclude that the best out¬ 
put representation is specific for each type of argument. 

1 Introduction 

Many important problems in machine learning can be 
framed as structured prediction tasks where the goal is to 
learn functions that map inputs to structured outputs such 
as sequences, trees or general graphs. A wide range of ap¬ 
plications involve learning over large state spaces, i.e., if 
the output is a labeled graph, each node of the graph may 
take values over a potentially large set of labels. Data 
sparsity then becomes a major challenge, as there will be 
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a potentially large number of classes with few training ex¬ 
amples. 

Within this context, we are interested in the task of pre¬ 
dicting semantic tuples for images. That is, given an input 
image we seek to predict what are the events or actions 
(predicates), who and what are the participants (actors) 
of the actions and where is the action taking place (loca¬ 
tives). Fig.[2shows two examples of the kind of results we 
obtain. To handle the data sparsity challenge imposed by 
the large state space, we will leverage an approach that has 
proven to be useful in multiclass and multilabel prediction 
tasks lEiia. The main idea is to represent a value for an 
argument a using a feature vector representation (p G IR". 
We will later describe in more detail the actual represen¬ 
tations that we used and how they are computed but for 
now imagine that we represent an argument by a real vec¬ 
tor where each component encodes some particular prop¬ 
erties of the argument. We will integrate this argument 
representation into the structured prediction framework. 

More specifically, we consider standard factorized lin¬ 
ear models where the score of an input/output pair is the 
sum of the scores, usually called potentials, of each fac¬ 
tor. In our case we will have unary potentials that measure 
the compatibility between an image and an argument of a 
tuple, and binary potentials that measure the compatibil¬ 
ity between pairs of arguments in a tuple. Typically, both 
unary and binary potentials are linear functions of some 
feature representation of the input/output pair. In contrast, 
we will consider a model that exploits bilinear unary po- 
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Our Approach (Generated Tuples) 


<act=puppy, pre=sit, loc=house> 
<act=dog, pre=sit, loc=bed> 


Top 5 sentences generated using CNNS 

a man and woman laying down to the couch 
with a bed. 

a man laying on top of a sofa with his dog. 
a man laying on the couch with his dog. 
a man is laying on the bed of a sofa, 
a man and dog bed in the back of a sofa. 



Our Approach (Generated Tuples) 


<act=boy, pre=sit, loc=street> 
<act=boy, pre=sit, loc=soccer> 


Top 5 sentences generated using CNNS 

a young woman wearing a pink and standing 
in front of a metal shoe, 
a boy in blue shirt and shorts is holding a 
sword to pick up two children in a batting 
position. 

a young boy holding something in a batting 
cage. 

a boy standing in a gym with a toy gun. 
a little boy is holding a basketball, looking at 
the ground. 


Figure 1; Automatic Tuple Generation. The proposed approach allows generating semantic tuples that have not been 
jointly observed before. For instance, in the left test image, the joint tuples {puppy, sit, house) and {dog, sit, bed) 
are not present in the training set, but our compositional approach can generate them. 


tentials (j){y,x) of the form v'^^^Wx, where Vy G is 
some real vector representation of an argument I G L and 
X G 1R‘^ is a d dimensional feature representation of an 
image. Similarly, the binary potentials a{y, y') will be of 
the form vJ^ZVy for a pair of arguments {y, y'). The rank 
of W and .Z can be interpreted as the intrinsic dimension¬ 
ality of a low-dimensional embedding of the inputs and 
arguments feature representation. Thus, if we want com¬ 
putationally efficient models (i.e. few features) it is natu¬ 
ral to use the rank of W and Z as a complexity penalty. 
Since using the rank would lead to a non-convex problem, 
we use instead the nuclear norm as a convex relaxation. 
We conduct experiments with two different feature rep¬ 
resentations of the outputs and show that integrating an 
output feature representation in the structured prediction 
model leads to better overall predictions. We also con¬ 
clude from our results that the best output representation 
is different for each argument type. 


Training Data 



Figure 2: Overview of our approach. 


2 Semantic l\iple Image Annotation 

2.1 Task 

We will address the task of predicting semantic tuples for 
images. Following ID , we will focus on a simple semantic 
representation that considers three basic arguments: pred- 
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icate, actors and locative. For example, an image might be 
annotated with the semantic tuples: {run, dog, park) and 
{play, dog, grass). We call each held of a tuple an argu¬ 
ment. For example, in the tuple t = {play, dog, grass), 
“play” is the argument of the predicate held, “dog” is the 
actor and “grass” the argument of the locative held. 

Given this representation, we can formally de- 
hne our problem as that of learning a function 
9 : XxPxAxL^M that scores the com¬ 
patibility between images and semantic tuples. Here, X 
is the space of images, P is a discrete set of predicate 
arguments, A is a set of actor arguments and L is a set 
of locative arguments. We are particularly interested in 
cases where \P\, |^| and \L\ are reasonably large. We 
will use T = P X Ax Lto refer to the set of possible tu¬ 
ples, and denote by {p al) a specihc instance of the tuple. 
To learn this function we are provided with a training set 
Q. Each example in this set consists of an image x and a 
set of corresponding semantic tuples {tc} which describe 
the events occurring in the image. Our goal is to use Q 
to learn a model for the conditional probability of a tu¬ 
ple given and image. We will use this model to predict 
semantic tuples for test images by computing the tuples 
that have highest conditional probability according to our 
learnt model. 

2.2 Dataset 

While some datasets of images associated with seman¬ 
tic tuples are already available they only consider 
small state spaces for each argument type. To address this 
limitation we decided to create a new dataset of images 
annotated with semantic tuples. In contrast to previous 
datasets, we consider a more realistic range of possible 
argument values. In addition, our dataset has the advan¬ 
tage that every image is annotated with both the under¬ 
lying semantics in the form of semantic tuples and natu¬ 
ral language captions that constitute different lexical real¬ 
izations of the same underlying semantics. To create our 
dataset we used a subset of the FlickrSk dataset, proposed 
in Hodosh et al. Q. This dataset consists of 8,000 im¬ 
ages taken from Flickr of people and animals performing 
some action, with five crowd-sourced descriptive captions 
for each one. These captions are sought to be concrete de¬ 
scriptions of what can be seen in the image rather than ab¬ 
stract or conceptual descriptions of non-visible elements 


(e.g., people or street names, or the mood of the image). 
This type of language is also known as Visually Descrip¬ 
tive Language 0. 

We asked human annotators to annotate 1,544 image 
captions, corresponding to 311 images (approximately 
one third of the development set), producing more than 
2,000 semantic tuples of predicates, actors and loca¬ 
tives. Annotators were required to annotate every cap¬ 
tion with their corresponding semantic tuples without 
looking at the referent image. We do this to ensure 
an alignment between the information contained in the 
captions and their corresponding semantic tuples. Cap¬ 
tions are annotated with tuples that consist of a predi¬ 
cate, a patient, an agent and a locative (indeed the pa¬ 
tient, the agent and the locative could themselves con¬ 
sist of multiple arguments but for simplicity we regard 
them as single arguments). For example, the caption “A 
brown dog is playing and holding a ball in a crowded 
park” will have the associated tuples: ( predicate = 
play, agent = dog, pacient = null, locative = park) 
and ( predicate = hold, agent = dog, pacient = 
ball, locative = park). Notice that while these anno¬ 
tations are similar to PropBank style semantic role anno¬ 
tations, there are also some differences. First, we do not 
annotate atomic sentences but captions that might actu¬ 
ally consist of multiple sentences. Second, the annotation 
is done at the highest semantic level and annotators are 
allowed to make logical inferences to resolve the argu¬ 
ments of a predicate. For example we would annotate the 
caption: “A man is standing on the street. He is hold¬ 
ing a camera” with ( predicate = .standing, agent = 
man, pacient = null, locative = street) and 
( predicate = hold, agent = man, pacient = 
null, locative = street). Figureshows two sample 
images with captions and annotated semantic tuples. For 
the experiments we partitioned the set of 311 images (and 
their corresponding captions and tuples) into a training set 
of 150 images, a validation set of 50 (used to adjust pa¬ 
rameters) and a test set of 100 images. 

To enlarge the manually annotated dataset we first used 
the data of captions paired with semantic tuples to train 
a model that can predict semantic tuples from image cap¬ 
tions. Similar to previous work we start by computing 
several linguistic features of the captions, ranging from 
shallow part of speech tags to dependency parsing and se- 
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Keywords 


Sentential Descriptions 


Semantic Tuples 



guy, man, ride, rollerblade, 
pole, night 


A guy on inline skates with a white hat on a yellow rail 


<act=guy, pre=be, loc=rail> 

A male skater is riding a yellow rail. 


<act=skater, pre=ride, loc=rail> 

A man rollerblades across a yellow pole at night. 


<act=man, pre=rollerblade, loc=pole> 

A inline skater boy balances on a yellow rail. 


<act=man, pre=rollerblade, loc=night> 

A skater does a trick on a yellow handrail. 


<act=skater, pre=do, loc=handrail> 


dog, water, stick, 
run, play 


A black dog chases a brown dog with a stick through the water. 


<act=dog, pre=chase, loc=water> 

A brown dog is running through water with a stick in its mouth. 


<act=dog, pre=run, loc=water> 

Two dogs playing in the water with a stick. 


<act=dog, pre=play, loc=water> 

Two dogs playing with a stick in the water. 


<act=stick, pre=play, loc=water> 

Two dogs running through the water with a stick. 


<act=dog, pre=run, loc=water> 


Figure 3: Sample images, keywords, sentences and semantic tuples from the augmented Flickr-8K dataset. 


mantic role labeling We extract the predicates by look¬ 
ing at the words tagged as verbs by the POS tagger. Then, 
the extraction of arguments for each predicate is resolved 
as a classification problem. More specifically, for each 
detected predicate in a sentence we regard each noun as 
a positive or negative training example of a given relation 
depending on whether the candidate noun is or is not an 
argument of the predicate. We use these examples to train 
a discriminative classifier that decides if a candidate noun 
is or is not an argument of a given predicate in a given 
sentence. This classifier exploits several linguistic fea¬ 
tures computed over the syntactic path of the dependency 
tree connecting the candidate noun and the predicate. As a 
classifier we trained a linear SVM. We run the learnt tuple 
predictor model on all the remaining 6,000 training im¬ 
ages and corresponding captions of the FickerSk dataset 
and produced a larger dataset of images paired with se¬ 
mantic tuples]^ 


^ We use the linguistic analyzer of 

^In the experimental section we actually build models to predict 
coarser triplets that consist of a locative a predicate and an actor. To 
convert from the finer {predicate, agent, patient, locative) anno¬ 
tations to the coarser annotations {predicate, actor, locative) we 
simply map the finer annotation to two coarser tuple annotations, one 
tuple for the actor and one tuple for the patient. 


3 Incorporating Output Feature 
Representations into a Factorized 
Linear Model 


For simplicity we will consider factorized sequence mod¬ 
els over sequences of fixed length. However, all the 
ideas we present can be easily generalized to other struc¬ 
tured prediction settings. In this section we first describe 
the general model and learning algorithm (Sections |3.1 


and 3.2 respectively), and then, in Section 3.3 we focus 
on the specific problem of learning tuples given input im¬ 
ages. 


3.1 Bilinear Models with Output Feature 
Representations 

Let X be an input, and let y = [yi ■ ■ ■ yr] be some output 
sequence where yi € L for some set of states L. We are 
interested in learning a model that computes P{y\x), i.e. 
the conditional probability of a sequence y given some 
input X. We will consider CRF-like factorized log-linear 
models that take the form; 


P{y\x) 


exp6>(a;,^) 
Ey exp6»(a;,j/) 


( 1 ) 


The scoring function 9{x, y) is modeled as a sum of 
unary and binary bilinear potentials and is defined as: 


9{x,y) 








( 2 ) 
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where Vy G 5?!"! is a feature representation of label y G 
L, and t) G is a feature representation of the f-th 
input factor of x. 

The first set of terms in the above equation are usually 
refered as unary potentials and measure the compatibility 
between a single state at t and the feature representation 
of input factor t. The second set of terms are the binary 
potentials and measure the compatibility between pairs of 
states at adjacent factors. The scoring function 9{x, y) is 
fully parameterized by the unary parameter matrices W G 
jf^\n\xd jjjg binary parameter matrices Z G ^. 

We will later describe the actual label feature represen¬ 
tations that we used in our experiments. But for now, it 
suffices to say that the main idea is to define a feature 
space so that semantically similar labels will be close in 
that space. Like in the multilabel scenario HU EH, having 
full feature representations for arguments will allow us to 
share information across different classes. 

One of the most important advantages of using feature 
representations for the outputs is that they give us the abil¬ 
ity to generalize better. This is because with a good output 
feature representation, our model should be able to make 
sensible predictions about pairs of arguments that were 
not observed at training. This is easy to see: consider a 
case were we have a pair of arguments represented with 
feature vectors ai and 02 and suppose that we have not 
observed the factor oi, 02 in our training data but we have 
observed the factor bi, 62 . Then if oi is close in the fea¬ 
ture space to argument bi and 02 is close to 62 our model 
will predict that oi and 02 are compatible. That is, it will 
assign probability to the pair of arguments 01,02 which 
seems a natural generalization from the observed training 
data. 

This kind of representation also has interesting inter¬ 
pretations in terms of the ranks of W and Z. Let W = 
UYiV be the singular value decomposition of W. We can 
then write the unary potential v^W(/)(x, t) as: 

vlU^[Vcl>{x,t)]. (3) 

Thus, we can regard the bilinear form as a function com¬ 
puting a weighted inner product between some real em¬ 
bedding vji U representing state y, and some real embed¬ 
ding [V(l){x,t)] representing input factor t. The rank of 
W gives us the intrinsic dimensionality of the embedding. 
Therefore, if we seek to induce shared low-dimensional 


Inputs: D,? 7 , 7 , c 
Output: W 
Initialize W = 0 
while t < Maxiter do 

Gt = d(Loss{D, {W}))/dW-, 

Wt+ 0.5 = Wt - i^tGu // vt is the 
learning rate 

VLi+0.5 = u^v^- 

V unary potentials define a diagonal matrix S' 
such that: ct' = max[cri — vtr]]', 

Wt + l = C7S'1/T; 

V binary potentials define a diagonal matrix E' 
such that: a' = max[cri — 1 ^ 47 ]; 

Wt + l = C/E'WT; 

end 

Algorithm 1: Learning Algorithm 

embeddings across different states it seems reasonable to 
impose a low rank penalty on W. 

Similarly, let Z = USV be the singular value decom¬ 
position of Z. We can write the binary potentials v^Zvy’ 
as: 

uJUEl/v (4) 

and thus the binary potentials compute a weighted inner 
product between a real embedding of state y and a real 
embedding of state y'. Again, the rank of Z gives us the 
intrinsic dimensionality of the embedding and, to induce a 
low dimensional embedding for binary potentials, we will 
impose a low rank penalty on Z. In practice, imposing 
low-rank constraints, would lead to a hard optimization 
problem, so instead we will use the nuclear norm as a 
convex relaxation of the rank function. 

3.2 Learning Algorithm 

After having described the type of scoring functions we 
are interested in, we now turn our attention to the learn¬ 
ing problem. That is, given a training set = {{x y)} 
of pairs of inputs x and output sequences y we need to 
learn the parameters {W} and {Z}. For this purpose we 
will do standard max-likelihood estimation and find the 
parameters that minimize the conditional negative log- 
likelihood of the data in D. That is, we will find the 
{W} and {Z} that minimize the following loss function 
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Loss{D,{W},{Z)y. 

- Y. ^ogP{y\x;{W},{Z}) 

{x y)&D 

It can be shown that this loss function is convex on {W} 
and {Z} whenever 9{x, t\ {W}, {Z}) is convex, which is 
the case for our scoring function. 

Recall that we are interested in learning low-rank unary 
and binary potentials. To this end we follow the standard 
approach which is to use the nuclear norm \ W\t and |Z|* 
(i.e. the li norm of the singular values) as a convex ap¬ 
proximation of the rank function. Putting all this together, 
the hnal optimization problem becomes: 

minLoss(£>, {W}) + ci ^ \Wt\* + C 2 ^ |^i|* (5) 

t t 

where Loss{D,{W}) = '^j_^j:iLoss{d,{W}) is the 
negative log likelihood function and ci and C 2 are two 
constants that control the trade off between minimizing 
the loss and the implicit dimensionality of the embed¬ 
dings. 

In recent years, many algorithms have been proposed 
for optimizing trace norm regularized problems (e.g., see 
Einiii). We use a simple optimization scheme known 
as Forward Backward Splitting, or FOBOS 0. It can be 
shown that FOBOS converges to the global optimum at a 
0 (l/e^) rate. 

The main steps of the optimization involve computing 
the gradient of the loss function and performing singular 
value decomposition on each W and Z. In our case, com¬ 
puting the gradient involves computing marginal proba¬ 
bilities for unary and binary potentials which has a cost of 
0(|Lp) and the cost of the SVD computation for each W 
in {W} and each Z in {Z}. 

3.3 Bilinear CRF for Predicate Prediction 

For our task we will consider a simple factorized scor¬ 
ing function 9{x, {p a 1)) that has unary terms relat¬ 
ing arguments of the same kind, and binary factors as¬ 
sociated with the locative — predicate pair and with 
the predicate — actor pair. Since this corresponds to 
a chain structure, argmax^g-r {p a 1)) can be effi¬ 
ciently computed using Viterbi decoding in time 0{N'^), 
where N = max(|P|, |yl|, |L|). Similarly, we can also 


find the top k predictions in 0{kN^). Alternatively, we 
could have defined the relationship between arguments 
via a fully connected graph and use approximate infer¬ 
ence methods. 

More specifically, the scoring function of the bilinear 
CRF we contemplate takes the form: 

0{x, {p a 1)) = Xioc{iyWioc(l)ioc{l) 

Xpre{py ^^pre4^ pre (P) 

Xact act (^) 

+(t>loc{iyWp°l(j)pre{p) 

+(l>pre{pyW^j:t(j)act{a) ( 6 ) 

where the A’s are the image representations and the (j)’s 
the textual ones. The unary potentials (first three terms 
in Eq. measure the compatibility between image and 
semantic arguments; the hrst binary potential measures 
the compatibility between the semantic representations of 
locatives and predicates, and the second binary potential 
measures the compatibility between predicates and actors. 
The scoring function is fully parameterized by the unary 
parameter matrices Wioc G Wpre C jj^dxnp 

and Wa G and by the binary parameter matrices 

G jjnixnp y^pre ^ j^npxna The parameters 

nl, np and na are the dimensionalities of feature repre¬ 
sentations for the locatives, predicates and actors. 

Note that if we let the argument representation (p{r) 
be an indicator vector in we obtain the usual 

parametrization of a standard factorized linear model: 

9{x,{pal)) = \ioc{iyw\^^ 

~k^pre{py 

+wlyiyip) + wzt{p,a) 

Like in the multilabel scenario Id] ED, having full fea¬ 
ture representations for arguments instead of indicator 
vectors will allow us to share information across different 
classes. In fact, we will use the model that uses indicator 
vectors as a baseline in our experiments. 
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4 Representing Semantic Argu¬ 
ments 

Recall that in order to handle the large number of possible 
arguments per field (i.e. data sparsity) our model assumes 
the existence of some feature representation for each ar¬ 
gument and type (j)pred{p) € 4>act{a) G and 

4>ioc € It is then that by learning an embedding of 

these vectors we will be able to share information across 
different classes. Intuitively, the feature vectors should 
describe properties of the arguments and should be de¬ 
fined so that feature vectors that are close to each other 
represent arguments that are semantically similar. 

We will conduct experiments with two different 
feature representations: 1) Fully unsupervised Skip- 
Gram based Continuous Word Representations (SCWR) 
and 2) a feature representation computed using the 
{caption, semantic — tuples) pairs, that we call Seman¬ 
tic Equivalence Representation (SER). We next describe 
in more detail each of these representations. 

4.1 Semantic Equivalence Representation 

We want to exploit the dataset of captions paired with se¬ 
mantic tuples to induce a useful feature representation for 
arguments. For this we will propose a way to illustrate 
the fact that any pair of semantic tuples associated with 
the same image will likely be describing the same event. 
Thus, they are in essence different ways of lexicalizing 
the same underlying concept. 

Let’s look at a concrete example. Imagine that we have 
an image annotated with the tuples: {play, dog, water) 
and {play, dog, river). Since both tuples describe 
the same image, it is quite likely that both “river” and 
“water” refer to the same real world entity, i.e, “river” 
and “water” are ’semantically equivalent’ for this image. 
Using this idea we build a representation (j)ioc{i) G fR^^' 
where the j-th dimension corresponds to the number of 
times the argument j has been semantically equivalent to 
argument i. 

More precisely, we compute the probability that argu¬ 
ment j can be exchanged with argument i as: , 

where [i,j]sr is the number of times that i and j have 
appeared as annotations of the same image and with the 
same other arguments. For example, for the actor argu¬ 


ments [i,j]sr represents the number of time that actor 
i and actor j have appeared with the same locative and 
predicate as descriptions of the same image. Here is a 
concrete example of the feature vector for the locative 
‘water’ (we report the non-zero dimensions and their cor¬ 
responding value): (j>ioc{water)=[ air 0.03, beach 0.06, 
boat 0.03, canoe 0.03, dock 0.13, grass 0.06, kayak 0.06, 
lake 0.06, mud 0.03, ocean 0.16, platform 0.03, pond 
0.06, puddle 0.1, rock 0.03, snow 0.03, tree 0.03, waterfall 
0.03]. Thus, according to the computed representation, 
‘water’ is semantically most similar to ‘ocean’. 

4.2 Skip-Gram based Continuous Word 
Representations 

Recently, there has been interest in learning word- 
representations, which have been proven to be useful for 
many structure prediction tasks IMiiniiiii. We use con¬ 
tinuous word representations (also known as distributed 
representations) to tailor a task-specific embedding. Con¬ 
tinuous word representations consist of neural network- 
based low-dimensional real valued vectors of each word. 
We use US’s skip-gram based approach for inducing con¬ 
tinuous word representations. Skip-gram based repre¬ 
sentations are essentially a single layer neural network, 
and are based on inner products between two word vec¬ 
tors. The objective function in a skip-gram is to pre¬ 
dict a word’s context given the word itself. We use the 
trained continuous word representations computed over 
the Google News dataset(100 billion words), that is pub¬ 
licly availably in our experiments. 

5 Related Work 

In recent years, some works have tackled the problem of 
generating rich textual descriptions of images. One of 
the pioneers is 03, where a CRF model combines the 
output of several vision systems to produce input for a 
language generation method. This seminal work, how¬ 
ever, only considered a limited set of a few tens of labels, 
while we aim at dealing with potentially hundreds of la¬ 
bels simultaneously. In ||4l, the authors find the similar¬ 
ity between sentences and images in a “meaning” space, 

^https://code.google.com/p/word2vec/ 
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represented by semantic tuples which are very similar to 
ours: triplets of object, action and scene. The main differ¬ 
ence with this work is that it uses a ruled based system to 
extract semantic tuples from dependency trees where we 
train a model that predicts semantic tuples and, most im¬ 
portantly, it uses a standard factorized linear model while 
we propose a model that leverages feature representations 
of arguments, and can therefore handle significantly larger 
state spaces. 

Other works focus on the simplified problem of rank¬ 
ing human-generated captions for images. In Q the au¬ 
thors propose to use Kernel Canonical Correlation Anal¬ 
ysis to project images and their captions into a joint rep¬ 
resentation space, in which images and captions can be 
related and ranked to perform illustration and annotation 
tasks. However, the system cannot be used to generate 
novel image descriptions for new images and, since a ker¬ 
nel is necessary, it has limitations on the number of im¬ 
age/caption pairs that can be used to define the subspace. 
In a follow-up work, the authors address improving the 
text/image embeddings with abundant weakly-annotated 
data from Flickr and similar sites using a stacked repre¬ 
sentation B- To cope with the large amounts of data. Nor¬ 
malized Canonical Correlation Analysis is used. Socher 
et al. |[T8l also address the ranking of images given a 
sentence and vice-versa using a common subspace, also 
known as zero-shot learning. Recursive Neural Networks 
are used to learn this common representation. The work 
of M performs natural text generation from images us¬ 
ing a bank of detectors to find objects and compressing the 
text to retrieve ‘generalizable’ small fragments. On top of 
this, a tree approach is used to construct sentences given 
the observations and fragments. However, the sentences 
produced this way can be easily corrupted by wrongly re¬ 
trieved segments. 

Recent works use deep networks to address the prob¬ 
lem: ED propose a pure deep network approach, where 
convolutional neural networks are used both to extract im¬ 
age features and recursive deep network to generate the 
text. The system is trained to maximize likelihood end- 
to-end. CD use a common multi-modal embedding to 
align text and images, and a recurrent neural network is 
trained to generate sentences directly from the image pix¬ 
els. Although these methods report good results in terms 
of BLEU score agreement with gold captions, they do not 
model the underlying visual predicates which is the goal 


of this paper. 

Using label embeddings and its combination with bi¬ 
linear forms has been previously proposed in the context 
of multiclass and multilabel image classification ID ED, 
but to the best of our knowledge there is no previous work 
on leveraging output embeddings in the context of struc¬ 
tured prediction. Thus, besides the concrete application 
to semantic tuple image generation, this paper presents 
a useful modeling tool for handling structured prediction 
problems in large state spaces. Our model can be used 
whenever we have some means of computing a feature 
representation of the outputs. 

6 Experiments 

As it is standard practice, in order to compute image 
representations (A-vectors in Eq|^, we use the 4,096- 
dimensional second to last layer of a Convolutional Neu¬ 
ral Network (CNN). The full network has 5 convolu¬ 
tional layers followed by 3 fully connected layers, and 
obtained the best performance in the ILSVRC-2012 chal¬ 
lenge. The network is trained on a subset of ImageNet 121 
to classify 1,000 different classes and we use the publicly 
available implementation and pre-trained model provided 
by m- The features obtained with this procedure have 
been shown to generalize well and outperform traditional 
hand-crafted features, thus they are already being used in 
a wide diversity of tasks ifT^I^ . 

To test our method we used the 100 test images that 
were annotated with ground-truth semantic tuples. Eor 
locatives, predicates and actors we consider the 400 most 
frequent. To measure performance we first compute the 
top 5 tuples for each image. Then, we define the set of 
predicted locatives to be the union of all predicted loca¬ 
tives and we do the same for the other argument types. 
Finally, we compute the precision for each type, for ex¬ 
ample, for the locatives this is the percentage of predicted 
locatives that were present in the gold tuples for the cor¬ 
responding image. 

The regularization parameters of each model were set 
using the validation set. We compare the performance of 
several models: 

• Baseline KCCA: This model implements the Kernel 
Canonical Correlation Analysis approach of Q. We 
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<act=people,pre=performJoc=air> 

<act=people,pre=jump,loc=air> 

<act=people,pre=wear,loc=air> 

<act=people,pre=watchJoc=air> 

<act=people,pre=perform,loc=pooi> 

<act=people,pre=sit,loc=air> 

<act=people,pre=gatherJoc=air> 



TRAINING SENTENCES 

A guy is doing a skateboard trick in front of a crowd 
A man is skateboarding in front of a group of people. 

A skateboarder performs a trick in front of a large crowd . 

A skateboarder leaping from a pool in front of a crowd. 
Skateboarder does tricks in front of crowd while photographer 
watches 


<act=man, pre=ride, loc=street> <act=giri,pre=sit,loc=pool> <act=peopie,pre=sit,loc=camera> 





Incorrect actor 


Incorrect actor 


<act: 


Joc=air> 



Incorrect actor & action 



Incorrect actor & locative 


Figure 4; Samples of predicted tuples. Top-left: Examples of visually correct predictions. Bottom: Typical errors on 
one or several arguments. Top-right: Sample image and its top predicted tuples. The tuples in blue were not observed 
neither in the SP-Dataset nor in the automatically enlarged dataset. Note that all of them are descriptive of what is 
occurring in the scene. 



Figure 5: Performance as a function of the size of the intrinsic embedded space for predicate (left) and locative (right) 
arguments. 


hrst note that this approach is able to rank a list of 
candidate captions but cannot directly generate tu¬ 
ples. To generate tuples for test images we first find 
the caption in the training set that has the highest 


ranking score for that image and then extract the cor¬ 
responding semantic tuples from that caption. These 
are the tuples that we consider as predictions of the 
KCCA model. 
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• Baseline Separate Predictors (SPred); We also con¬ 
sider a baseline made of independent predictors for 
each argument type. More specifically we train one- 
vs-all SVMs (we also tried multi-class SVMs but 
they did not improve performance) to independently 
predict locatives, predicates and actors. For each 
argument type and candidate label we have a score 
computed by the corresponding SVM. Given an im¬ 
age we generate the top tuples that maximize the sum 
of scores for each argument type. 

• Embedded CRF with Indicator Features (IND), this 
is a standard factorized log-linear model that does 
not use any feature representation for the outputs. 

• Embedded CRF with a model that uses the skip-gram 
continuous word representation of outputs (SCWR). 

• Embedded CRF with a model that uses that semantic 
equivalence representation of outputs (SER). 

• A combined model that makes predictions using the 
best feature representation for each argument type 
(COMBO). 

Table [T] reports the results for the baselines and of the 
different CRF schemes. The first observation is that the 
best performing output feature representation is different 
for each argument type. For the locatives the best repre¬ 
sentation is SER, for the predicates is the SCWR and for 
the actors using an output feature representation causes 
a drop in performance. The largest improvement from 
using an output feature representation that we obtain is 
on the predicate arguments, where we improve almost by 
10% over the indicator representation by using the skip- 
gram representation. Overall, the model that uses the best 
representation performs better than the indicator baseline. 

Finally, Figure shows performance as a function of 
the dimensionality of the learnt embedding, i.e. rank of 
parameter matrices, as we can see the learnt models are 
efficient in the sense that they can work well with low¬ 
dimensional projections of the features. 

7 Conclusion 

In this paper we have presented a model for exploiting 
input and output embeddings in the context of structured 



Spred 

KCCA 

IND 

SCWR 

SER 

COMBO 

LOC 

15 

23 

32 

28 

33 


PRED 

11 

20 

24 

33 

25 


ACT 

30 

25 

52 

51 

50 


MEAN 

18.6 

22.6 

36 

37.3 

36 

39.3 


Table 1; Precision of baseline and CRFs with different 
output embeddings. 

prediction. We have applied this framework to the prob¬ 
lem of predicting compositional semantic descriptions of 
images. Our results show the advantages of using output 
embeddings for handling large state spaces. We have also 
seen that regularizing with the nuclear norm we can obtain 
computationally efficient low-rank models with compara¬ 
ble performance. 
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